Bayesian modeling of human–AI complementarity

Significance With the increase in artificial intelligence in real-world applications, there is interest in building hybrid systems that take both human and machine predictions into account. Previous work has shown the benefits of separately combining the predictions of diverse machine classifiers or groups of people. Using a Bayesian modeling framework, we extend these results by systematically investigating the factors that influence the performance of hybrid combinations of human and machine classifiers while taking into account the unique ways human and algorithmic confidence is expressed.


Distribution of Machine Classifier Logit Scores. Supplemental
shows the empirical distributions of the λ logit scores 119 for correct and incorrect classes for the 16-class ImageNet dataset. The distributions are approximately normal (with some left 120 and right skew for the incorrect and correct label distributions).

121
Pattern of class-specific errors by human and machine classifiers. Some of the differences between human and machine 122 classifiers can be summarized by looking at the pattern of correct and incorrect classifications at the level of individual classes.
123 Figure S8 shows the class-wise confusion matrices for humans and each of the four machine classifier for the most challenging 124 level of image distortion in the experiment. The machine classifiers are fine-tuned for one epoch. The machine classifier 125 VGG-19, for example, makes more correct classifications for classes such as truck, dog and bird, whereas the human makes 126 more correct classifications for the car class. In addition, there are a number of class-confusions that are more prevalent in the 127 machine classifier relative to humans (e.g., confusing cats with dogs). These results show that human and machine classifiers 128 make different types of errors at the class level. 129 To further evaluate the class-specific errors, we analyze the parameters inferred by the class-specific error model. Specifically, 130 we assess a discrimination score dj = (aj − bj)/σ for each label j. This score represents the separation between the logit scores 131 for the correct and incorrect label normalized by the standard deviation. This score determines the ability of the classifier to 132 discriminate between that label and all other labels, analogous to the discriminability index in signal detection theory (10). 133 The baseline parameter bj determines the response bias for label j. If bj is relatively high for one particular label, the model 134 predicts higher confidence scores and a larger number of responses (a priori) for that label. To facilitate interpretation, we will 135 report mean centered b values (i.e., j bj = 0).
136 Table S1 shows the resulting estimates of discrimination and bias scores when the model extension is applied to a hybrid 137 ensemble of a single human and the VGG-19 classifier. Across image noise levels, the VGG-19 classifier is biased towards the 138 labels "dog", "truck", and biased away from "airplace" and "knife" whereas the human participants reveal small response 139 biases toward "boat", "car", and "dog" and away from "knife". In terms of the relative discrimination ability (i.e., dH − dM ), 140 the human participants are better able to detect the "car", "clock", and "knife" labels relative to VGG-19, whereas the CNN 141 classifier is relatively good at detecting "boat" and "bird". Overall, the results show systematic differences between human and 142 machine classifiers in terms of response biases and ability to discriminate between individual classes.

143
Robustness to confidence scoring. One potential contributing factor to complementarity is the difference in the type and 144 amount of information available from machine and human classifier. The machine classifier provides a full set of confidence 145 scores across all classes whereas the human classifier provides only a single confidence score (associated with the classification 146 made for the instance). In addition, the machine classifier scores are continuous whereas the human confidence score is discrete 147 (three responses, "high", "medium", and "low").

148
To verify that our findings are robust to changes in the way confidence scores are produced, we also applied the Bayesian 149 combination model when the machine classifier confidence score are only observed for the winning class for each instance and 150 the scores are discretized to three bins (analogous to the three confidence levels for the human classifiers The results show that the hybrid pair performance with partially observed, discretized machine classifier scores are somewhat 156 lower relative to pairs with the full machine classifier information and fewer hybrid classifiers exceed the performance of two 157 humans. However, the overall pattern of accuracy is qualitatively similar. Critically, the pattern of correlations is qualitatively 158 the same. Hybrid combinations of classifiers produce the lowest correlations, and machine-only combinations produce the 159 highest correlations. As expected, posterior uncertainty for pairs of machine classifier combinations has increased due to the 160 decrease in available confidence scores.

161
Therefore, these results show that human-machine classifier complementarity is in part influenced by the type of confidence 162 scores available. Having a full set of continuous confidence scores contributes to improved pair performance. (iv) The marginal distribution over classes is uniform, i.e. the marginal probability of seeing class i is p(z = i) = 1/L for Assumption (ii) implies that H1 and H2 are exchangeable under our Bayesian model (respectively for assumption (iii) and 177 M1, M2). Without loss of generality, we can additionally assume that the true class is the first one. We use ρHH , ρMM to 178 denote the correlation parameter between the human (respectively model) labelers above and ρHM to denote the correlation 179 between any human with any machine classifier.

180
An illustrative special case: L = 2 classes. To demonstrate our analysis, we begin with the special case of binary (L = 2) 181 classification.

182
For an individual classifier C, there are two logit scores sampled in the model, λ1 ∼ N (a, σ) and λ2 ∼ N (0, σ), associated 183 with the correct and incorrect class respectively. The accuracy for this classifier conditional on model parameters is where φ(λ|µ, σ) is the normal density for x given mean µ and standard deviation σ and Φ denotes the cumulative distribution 186 function for the standard normal distribution. Hence, for two individual classifiers, C1 and C2, we will have For two classifiers, we have two pairs of logit scores. For the correct and incorrect class, the pairs of logit scores are 189 sampled from the bivariate normals, The accuracy for the combined classifier is then In order to facilitate the study of complementarity, we specialize the above results to the case of homogeneous and 193 heterogeneous pairs.

194
For the sake of simplicity, we describe the homogeneous analysis in the case of an pair of two humans H1 and H2. The

195
analysis can translated to that of homogeneous pairs of models by making the necessary changes in notation.

196
In this case, under the set of assumptions outlined above, by simplifying Equation Eq. (4) we can express the pair accuracy where ρHH ∈ [0, 1] is the human-human correlation. As we would expect, as ρHH increases, the accuracy of the pair will 200 decrease.

201
To illustrate these results, we compare to the accuracy of a single human. By Equation Eq. (3) and Eq. (4), we have Note that ρHH ≤ 1, so that this inequality will always hold, i.e. under our assumptions the pair of two humans will always 205 have a higher accuracy than a single human.

206
In the case of a heterogeneous pair consisting of a human labeler and model labeler, assume further that the models and 207 humans have the same variance, i.e. σ := σH = σM . We can express the heterogeneous human-model pair accuracy AHM as Conditions for complementarity.
We derive a necessary and sufficient condition on the correlation ρHM to achieve complementarity. Since Φ(·) is a strictly increasing function, it suffices to compare the arguments of Equation This is quadratic in ρHM , allowing us to solve for the conditions on ρHM that will lead to complementarity. mean that λ1 > λj for j = 2, 3, . . . , L, i.e. the score for the correct label is greater than the score for every other class.
We can perform a similar analysis for the pair of C1 and C2: We can use the above integral forms to derive an if and only if condition for complementarity. Let rC 1 ,C 2 be the ratio that 221 appears in the argument of φ(·) in Equation Eq. (10): Under our assumptions, this ratio can be simplified to a more interpretable form in the hybrid and non-hybrid cases: Proof. We prove that rHM > rHH is sufficient for AHM > AHH . The proof for the pair of two models is analogous, and so 230 rHM > max{rHH , rMM } will satisfy AHM > AHH and AHM > AMM simultaneously. The same argument also works (with 231 minor modifications) to prove the "only if" part of the statement.

232
Set ∆H = rHM − rHH . We have ∆H > 0 by assumption. We can evaluate the accuracies with the above formulae and use a change of variables to prove the claim:  Fig. S1. Illustration of the generative process of the Bayesian model that produces the classification and confidence scores for a single human (H) and machine classifier (M ). In the example, there are three classes and the ground truth (z) for a particular image is "Bear". The ground truth selects for each label a bivariate normal distribution with means (a M ,a H ) shown in green or (b M ,b H ) shown in red when the ground truth matches or mismatches the label respectively. A single sample (white circle) is taken from each selected bivariate normal distribution to produce the correlated logit scores (λ) for the human and machine classifier. The separation of the means between matching and mismatching distribution (a − b) determines the discrimination ability of the classifier for that class whereas the mean of the mismatching distribution (b) determines response bias for that class. In this example, the human classifier has a response bias for "Dog". For the machine classifier, the logit scores are transformed to observable probabilities (γ). For the human, a softmax is applied to the latent confidence scores (γ) to determine the classification (dog) and an ordinal probit model is used to sample the observed confidence rating ("Medium").